Seoul Bike Rent

Authors: Bordas Alexandre and Carval Nicolas

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Ce qu'il reste à faire

- Rapport

- Css (horizontal et/ou 2 éléments par ligne, "radio" (bouton) à styliser)

- html (regex pour temperature et wind speed)

- charger le fichier pickle (model) sur github

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Here is our work for this dataset named "Seoul Bike Sharing Demand".

This dataset contains count of public bikes rented at each hour in Seoul Bike sharing System with the corresponding Weather data and Holidays information.

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

Thus we will explore this dataset in order to show the link between the number of bikes rented and the other variables present in the dataset. We will then elaborate a machine learning model in order to provide a way to predict more or less the number ok bikes that could be rented under specific conditions.

You can find the dataset here: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

You'll find:

  • first approach of the data
  • Preprocessing of the data
  • Visualisation
  • Machine Learning

Acknowledgement:

Libraries

In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go

# Interactive visualisation libraries
from ipywidgets import widgets
from ipywidgets import interact

#Sabry.abdellah@gmail.com
#Abdellah.sabry@ext.devinci.fr

Dataset

In [2]:
df = pd.read_csv("SeoulBikeData.csv", encoding='latin1')
df.head()
Out[2]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functioning Day            8760 non-null   object 
dtypes: float64(6), int64(4), object(4)
memory usage: 958.2+ KB

Every columns are in the correct format except the Date column, thus we'll convert it later.

In [97]:
df.describe()
Out[97]:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm)
count 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.000000 8465.00000 8465.000000
mean 729.156999 11.507029 12.771057 58.147194 1.725883 1433.873479 3.944997 0.567868 0.14912 0.077685
std 642.351166 6.920899 12.104375 20.484839 1.034281 609.051229 13.242399 0.868245 1.12554 0.444063
min 2.000000 0.000000 -17.800000 0.000000 0.000000 27.000000 -30.600000 0.000000 0.00000 0.000000
25% 214.000000 6.000000 3.000000 42.000000 0.900000 935.000000 -5.100000 0.000000 0.00000 0.000000
50% 542.000000 12.000000 13.500000 57.000000 1.500000 1690.000000 4.700000 0.010000 0.00000 0.000000
75% 1084.000000 18.000000 22.700000 74.000000 2.300000 2000.000000 15.200000 0.930000 0.00000 0.000000
max 3556.000000 23.000000 39.400000 98.000000 7.400000 2000.000000 27.200000 3.520000 35.00000 8.800000
In [4]:
df.isna().sum()
Out[4]:
Date                         0
Rented Bike Count            0
Hour                         0
Temperature(°C)              0
Humidity(%)                  0
Wind speed (m/s)             0
Visibility (10m)             0
Dew point temperature(°C)    0
Solar Radiation (MJ/m2)      0
Rainfall(mm)                 0
Snowfall (cm)                0
Seasons                      0
Holiday                      0
Functioning Day              0
dtype: int64

No information is missing.

In [217]:
print(df.groupby("Date")[["Hour","Seasons"]].count().shape[0],"Days in the year, which is correct. No days missing")
365 Days in the year, which is correct. No days missing

looking for outliers :

In [218]:
df[(df['Rented Bike Count']==0)][["Rented Bike Count","Functioning Day"]]
Out[218]:
Rented Bike Count Functioning Day
3144 0 No
3145 0 No
3146 0 No
3147 0 No
3148 0 No
... ... ...
8251 0 No
8252 0 No
8253 0 No
8254 0 No
8255 0 No

295 rows × 2 columns

We see that when Rented Bike Count is equal to 0 it corresponds to non functionning day.

Maybe by looking to every non functionning day we can get more information :

In [5]:
df[df["Functioning Day"]=='No'].equals(df[(df['Rented Bike Count']==0)])
Out[5]:
True

In fact there are 13 Days in the year that are non functionning. These days are representing all the days without bike rented. So we'll do some treatment on it

In [6]:
print("Number of days within the year :",len(df.groupby("Date")))
df[df["Functioning Day"]=='No'].groupby("Date").count()
Number of days within the year : 365
Out[6]:
Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
Date
02/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
03/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
04/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
06/10/2018 7 7 7 7 7 7 7 7 7 7 7 7 7
06/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
09/10/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
09/11/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
10/05/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
11/04/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
18/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
19/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
28/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24
30/09/2018 24 24 24 24 24 24 24 24 24 24 24 24 24

PreProcessing

We keep only functionning days. As we know that it will always be 0 for non Functionning days. Thus we drop the column.

In [7]:
df = df[df["Functioning Day"]!='No']
In [8]:
df.drop(columns=["Functioning Day"],inplace=True)
df.head()
Out[8]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday
In [223]:
print("Nombre de jours restant dans l'année :",len(df.groupby("Date")))
Nombre de jours restant dans l'année : 353

We have 353 days left, which corresponds to 365-12. There is no mistake. In fact if you look back at all the non functionning days, you'll notice that for 06/10/2018 there are only 7 hour in that day that aren't functionning.

Now we want to convert the date into the correct format :

In [9]:
df['Date'] =  pd.to_datetime(df['Date'],format="%d/%m/%Y")
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8465 entries, 0 to 8759
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       8465 non-null   datetime64[ns]
 1   Rented Bike Count          8465 non-null   int64         
 2   Hour                       8465 non-null   int64         
 3   Temperature(°C)            8465 non-null   float64       
 4   Humidity(%)                8465 non-null   int64         
 5   Wind speed (m/s)           8465 non-null   float64       
 6   Visibility (10m)           8465 non-null   int64         
 7   Dew point temperature(°C)  8465 non-null   float64       
 8   Solar Radiation (MJ/m2)    8465 non-null   float64       
 9   Rainfall(mm)               8465 non-null   float64       
 10  Snowfall (cm)              8465 non-null   float64       
 11  Seasons                    8465 non-null   object        
 12  Holiday                    8465 non-null   object        
dtypes: datetime64[ns](1), float64(6), int64(4), object(2)
memory usage: 925.9+ KB

Adding a month column in order to analyse more closely later :

In [10]:
df["Month"] = df["Date"].apply(lambda x: x.strftime("%B"))
df.tail()
Out[10]:
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Month
8755 2018-11-30 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday November
8756 2018-11-30 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday November
8757 2018-11-30 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday November
8758 2018-11-30 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday November
8759 2018-11-30 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday November

We already know what to expect of standard columns such as date, hour, seasons and others. However some may be interesting to analyse aside from others. As they may not be that much variated. So we're looking at these columns more closely :

In [11]:
fig, ax = plt.subplots(figsize=(20,20),nrows=2,ncols=2)

df["Snowfall (cm)"].plot.hist(ax=ax[0,0],grid=True, bins=20, rwidth=0.9, color='#b9d9fa')
df["Rainfall(mm)"].plot.hist(ax=ax[0,1],grid=True,bins=20, rwidth=0.9, color='blue')
df["Visibility (10m)"].plot.hist(ax=ax[1,0],grid=True,bins=20, rwidth=0.9, color='grey')
df["Solar Radiation (MJ/m2)"].plot.hist(ax=ax[1,1],grid=True,bins=20, rwidth=0.9,xlabel="Solar Radiation (MJ/m2)",
                                        color='red')

ax[0,0].set_title("Frequency of Snow fall in cm over the year")
ax[0,0].set_xlabel("Snowfall (cm)")

ax[0,1].set_title("Frequency of Rainfall in mm over the year")
ax[0,1].set_xlabel("Rainfall(mm)")

ax[1,0].set_title("Frequency of Visibility in 10m over the year")
ax[1,0].set_xlabel("Visibility (10m)")

ax[1,1].set_title("Frequency of Solar Radiation in MJ/m2 over the year")
ax[1,1].set_xlabel("Solar Radiation (MJ/m2)")

fig.suptitle('Histograms of different columns', fontsize=20,y=0.92);

Snow and Rain values aren't much varied, thus lets create 2 new columns for rain and snow to convert them into categorical variables

In [13]:
day = df.groupby("Date").mean()
print("Not snowy :",day[day["Snowfall (cm)"]==0].shape[0]," days")
print("Not rainny :",day[day["Rainfall(mm)"]==0].shape[0], "days")
Not snowy : 326  days
Not rainny : 253 days
In [14]:
df["Snow"] = (df["Snowfall (cm)"]!=0)
df["Rain"] = df["Rainfall(mm)"].apply(lambda x: "No rain" 
                                      if x==0 else ("light rain" if x<3 
                                                    else ("Medium rain" if x<8 
                                                          else "Strong rain")))

Now lets check the correlation between every variables :

In [15]:
corr= df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(10, 10))
    ax = sns.heatmap(abs(corr), mask=mask, vmax=1, square=True,cmap="Reds")
    ax.set_title("Correlation matrix of the Data set")
In [16]:
corr_matrix = df.corr()[['Rented Bike Count']].sort_values(by = ['Rented Bike Count'], ascending = False).drop(['Rented Bike Count'])
corr_matrix.style.background_gradient(cmap = 'coolwarm').format(precision=2)
Out[16]:
  Rented Bike Count
Temperature(°C) 0.56
Hour 0.43
Dew point temperature(°C) 0.40
Solar Radiation (MJ/m2) 0.27
Visibility (10m) 0.21
Wind speed (m/s) 0.13
Rainfall(mm) -0.13
Snowfall (cm) -0.15
Snow -0.20
Humidity(%) -0.20

For our regression model we will decide to drop dew point as it is too corrolated to temperature. And temperature is better correlated with rented bike count. we notice an improvement of the correlation between Snow info and Rented bike count, as it is now a boolean (Snowfall (cm) vs Snow).

Visualisation

Seasonal plots
In [17]:
fig = px.scatter(df.sort_values(by="Date"),x="Date",y="Rented Bike Count",color="Seasons",
                 title="Scatterplot of Rented Bike Count by every day (including every hour)")
fig.show()

We can cleary see that winter is the season that stands out of the others. The values are massed together from 0 to 500 bikes per hour. Which is pretty low compared to other seasons.

We can also see that there is much more volatility when we get closer to summer, values are really disparsed and can reach 3500 bikes per hour.

In [18]:
temp = df.groupby(["Holiday","Seasons"])["Rented Bike Count"].mean().reset_index()
fig = px.bar(temp,x="Holiday", y="Rented Bike Count", color="Seasons",
             title="Barplot of the mean of rented bike count whether it is holiday or not, separated by seasons")
fig.show()

We decided to plot the mean as the number of vacation days is really low compared to working days.

Thus, we see that holidays don't have much influence on the mean of rented bike, even if there is a number of bikes rented in holidays that is a little bit lower, for each seasons.

In [22]:
fig = px.pie(df.groupby("Seasons")["Rented Bike Count"].sum().reset_index(),values="Rented Bike Count",
             names='Seasons',title="Pie Chart representing the proportion of total rented bike by seasons")
fig.show()

As we expected from the previous graphs, winter doesn't even represent 1/10 of the total bikes rented. And Summer is the dominant season with more than 1/3 of the proportion.

Plots by hour
In [23]:
temp1 = (df[df["Holiday"]=="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]=="No Holiday"]["Rented Bike Count"])).reset_index()
temp2 = (df[df["Holiday"]!="No Holiday"].groupby(["Hour","Holiday"])["Rented Bike Count"].sum()*100/sum(df[df["Holiday"]!="No Holiday"]["Rented Bike Count"])).reset_index()
temp = temp1.append(temp2)

temp.rename(columns={'Rented Bike Count': '% of rented bike count'}, inplace=True)

fig = px.bar(temp, x="Hour", y="% of rented bike count", color = "Holiday", 
             title='Proportion of rented bike per hour, taking in count if we are in holiday or not',
             color_discrete_sequence=['indianred', 'lightsalmon'])
fig.update_layout(barmode='group')
fig.show()

Every columns for holiday adds up to 1, same for non holiday columns.

This graph enables us to understand the trend underlying holidays. As there are 10x more non holiday row, we don't see the impact if we ponderate the values and then can see the trend.

Conclusion: It follows the same stable pattern for both categories. A drop during the night. And a pick during afternoon. However we can see that for working days, 8h and 18h are much more dominant. this is explained by the beginning and the end of days for most workers. They are travelling home or going to work. Which of course isn't the same during holidays.

In [25]:
temp = df.groupby(["Rain","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Rain',title='Line plot of the mean of rented bikes per hour taking in count the Rain')
fig.show()

It doesn't rain much in the year. Thus we use the mean of rented bikes to interpret the trend of this column.

We can clearly see that the rain affects the number of rented bikes. When it rains, people don't rent bikes and probably use public transports. However we can still distinguish the 8h and 18h pics when it rains.

In [27]:
temp = df.groupby(["Seasons","Hour"])["Rented Bike Count"].mean().reset_index()
fig = px.line(temp, x="Hour", y="Rented Bike Count", color='Seasons',title='Line plot of the mean of rented bikes per hour taking in count seasons')
fig.show()

Same as before, we clearly see the trend that happens at 8h and 18h every day. We also see that every seasons seems to follow the same pattern except Winter. However at night we see that there are more bike rented in summer, it may be due to the higher temperature at night and the sun going down later.

Plot of temperature
In [29]:
# temperature
temp = df.groupby(["Temperature(°C)","Seasons"])["Rented Bike Count"].sum().reset_index()
px.scatter(data_frame = temp
           ,x = 'Temperature(°C)'
           ,y = 'Rented Bike Count',color="Seasons",title='Scatter plot of the total number of bikes rented by temperature and season')

we can see that the temperature is very correlated to the rented bikes, as it gets higher and closer to 20 °C the number of bikes rented are higher in general. If it surpass 30 °C then it decreases drastically as the temperature is maybe to hot for people to go biking.

In winter temperatures are low thus rented bike number is low, and we get the same conclusion as before for the other seasons.

Plot of visibility
In [31]:
temp = df.groupby("Visibility (10m)")["Rented Bike Count"].mean().reset_index()
fig = px.scatter(temp, x="Visibility (10m)", y="Rented Bike Count",
           hover_name="Rented Bike Count", size_max=60,trendline="ols",trendline_color_override="red", title='Scatter plot of the Rented bike mean by visibility, with ordinal least square trend line')
fig.show()

It not really helpful for us, it doesn't seems to be an important variable.

Interactive Plots
  • Simple Scatterplots in order to show the correlation of different columns on the Rented bike count.
  • Each point corresponds to the mean of bike rented by every unique value of the column
In [32]:
textbox = widgets.Dropdown(
    description='Column:   ',
    value="Temperature(°C)",
    options=["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]
)

X = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Temperature(°C)"]
Y = df.groupby("Temperature(°C)")["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
#Assign an empty figure widget with two traces
trace = px.scatter(x=X, y=Y,title="Rented bike mean, by temperature",labels={'x':"Temperature(°C)", 'y':'mean of bikes'})

g = go.FigureWidget(data=trace,
                    layout=go.Layout(
                        title=dict(
                            text='scatter')))

def validate():
    if textbox.value in ["Visibility (10m)","Hour","Temperature(°C)","Humidity(%)","Wind speed (m/s)","Solar Radiation (MJ/m2)"]:
        return True
    else:
        return False


def response(change):
    if validate():                        
        x1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()[textbox.value]
        y1 = df.groupby(textbox.value)["Rented Bike Count"].mean().reset_index()["Rented Bike Count"]
        with g.batch_update():
            g.data[0].x=x1
            g.data[0].y=y1
            g.layout.title = "Rented bike mean, by "+str(textbox.value)
            g.layout.xaxis.title = textbox.value
            g.layout.yaxis.title = 'mean of Rented bike'

textbox.observe(response, names="value")
In [34]:
container2 = widgets.HBox([textbox])
widgets.VBox([container2,g])

These graphs can help us visualize correlation for variables we didn't analyse yet.

  • Bar plot in order to compare the number of bike rented by hour
  • between two different month
In [35]:
@interact
def view_image(
    col=widgets.Dropdown(
        description="Month #1 :", value="July", options=df["Month"].unique()
    ),
    filtercol=widgets.Dropdown(
        description="Month #2 :", value="August", options=df["Month"].unique()
    ),
):
    temp1 = (df[df["Month"]==col].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
    temp2 = (df[df["Month"]==filtercol].groupby(["Hour","Month"])["Rented Bike Count"].sum()).reset_index()
    temp = temp1.append(temp2)
    newTitle= "Bar plot of the total number of bike rented by hour between "+col+" and "+filtercol
    fig = px.bar(temp, x="Hour", y="Rented Bike Count", color="Month",title=newTitle).update_layout(barmode="group")
    go.FigureWidget(fig.to_dict()).show()

This graph is really helpful when we want to compare two months together. Side by side

Machine Learning

The goal of this project is to predict an output when we give a model some parameters. In our case, we want to predict how many bikes will likely be rented on given parameters (Hour, season...).

This is why we use machine learning : we will try different models that will predict the output with a certain accuracy. In the end, we aim to seek the best machine learning model that fits our dataset and that gives us the most accurate prediction. We will save it to a Pickle file in order to use it in our flask API.

Here are the different steps :

  • Preprocessing
  • Definition of results storing methods
  • Regressions model :
    • Linear regression
    • Robust regression
    • Ridge regression
    • Random forest
    • SVM regression
  • Model selection
  • Model tuning
  • Model saving

We used this format in order to show the results :

Libraries

In [36]:
import sklearn
from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

# models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RANSACRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

Data set

In [99]:
X = df[['Hour','Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)',
        'Solar Radiation (MJ/m2)','Seasons','Holiday','Month','Snow','Rain']]

y = df['Rented Bike Count']
In [100]:
X.head()
Out[100]:
Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Solar Radiation (MJ/m2) Seasons Holiday Month Snow Rain
0 0 -5.2 37 2.2 2000 0.0 Winter No Holiday December False Pas de pluie
1 1 -5.5 38 0.8 2000 0.0 Winter No Holiday December False Pas de pluie
2 2 -6.0 39 1.0 2000 0.0 Winter No Holiday December False Pas de pluie
3 3 -6.2 40 0.9 2000 0.0 Winter No Holiday December False Pas de pluie
4 4 -6.0 36 2.3 2000 0.0 Winter No Holiday December False Pas de pluie

In order to be able to train the models we need to dummify the categorical predictors

Encoding categorical predictors

In [101]:
X["Holiday"] = X["Holiday"].map( {'No Holiday': 0, 'Holiday': 1} ).astype(int)
X["Rain"] = X["Rain"].map( {'Pas de pluie': 0, 'Pluie faible': 1,'Pluie modérée':2,'Pluie forte':3} ).astype(int)
#X["Month"] = X["Month"].map({"January":1, "February":2, "March":3, "April":4, "May":5,
#                             "June":6, "July":7, "August":8, "September":9, "October":10,
#                             "November":11, "December":12}).astype(int)
X["Snow"] = X["Snow"].map( {False: 0, True: 1} ).astype(int)

#We dummify Visibility as we concluded during analysis, same for solar radiation
X['Visibility (10m)']=X['Visibility (10m)'].apply(lambda x: 1 if x==2000 else 0)
X['Solar Radiation (MJ/m2)']=X['Solar Radiation (MJ/m2)'].apply(lambda x:1 if x>=0.5 else 0)

seasons = pd.get_dummies(X.Seasons)
months = pd.get_dummies(X.Month)

X = pd.merge(X, seasons, left_index=True, right_index=True)
X.drop(columns="Seasons",inplace=True)

X = pd.merge(X, months, left_index=True, right_index=True)
X.drop(columns="Month",inplace=True)
C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\Nicolas Carval\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [87]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8465 entries, 0 to 8759
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Hour                     8465 non-null   int64  
 1   Temperature(°C)          8465 non-null   float64
 2   Humidity(%)              8465 non-null   int64  
 3   Wind speed (m/s)         8465 non-null   float64
 4   Visibility (10m)         8465 non-null   int64  
 5   Solar Radiation (MJ/m2)  8465 non-null   int64  
 6   Holiday                  8465 non-null   int32  
 7   Snow                     8465 non-null   int32  
 8   Rain                     8465 non-null   int32  
 9   Autumn                   8465 non-null   uint8  
 10  Spring                   8465 non-null   uint8  
 11  Summer                   8465 non-null   uint8  
 12  Winter                   8465 non-null   uint8  
 13  April                    8465 non-null   uint8  
 14  August                   8465 non-null   uint8  
 15  December                 8465 non-null   uint8  
 16  February                 8465 non-null   uint8  
 17  January                  8465 non-null   uint8  
 18  July                     8465 non-null   uint8  
 19  June                     8465 non-null   uint8  
 20  March                    8465 non-null   uint8  
 21  May                      8465 non-null   uint8  
 22  November                 8465 non-null   uint8  
 23  October                  8465 non-null   uint8  
 24  September                8465 non-null   uint8  
dtypes: float64(2), int32(3), int64(4), uint8(16)
memory usage: 952.4 KB

Splitting in test and train sets

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

Results storing methods

In [43]:
def cross_val(model):
    pred = cross_val_score(model, X, y, cv=10)
    return pred.mean()

def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
    print('__________________________________')
    
def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

Fitting multiple models

Linear regression

In [103]:
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train);
In [104]:
lin_test_pred = lin_reg.predict(X_test)
lin_train_pred = lin_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lin_test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lin_train_pred)
Test set evaluation:
_____________________________________
MAE: 161.5252130592934
MSE: 43064.165586205396
RMSE: 207.51907282513912
R2 Square 0.5332858027607389
__________________________________
Train set evaluation:
_____________________________________
MAE: 154.63385344437665
MSE: 38881.47546619944
RMSE: 197.18386208358797
R2 Square 0.5920502905763476
__________________________________
In [91]:
results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, lin_test_pred) , cross_val(LinearRegression())]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])

Robust Regression

In [92]:
rb_reg = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
rb_reg.fit(X_train, y_train)

rb_test_pred = rb_reg.predict(X_test)
rb_train_pred = rb_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rb_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rb_train_pred)
Test set evaluation:
_____________________________________
MAE: 325.62909322831104
MSE: 225958.57864180268
RMSE: 475.35100572293175
R2 Square 0.44761168138006135
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 332.838424315244
MSE: 234942.95299673185
RMSE: 484.70914267912445
R2 Square 0.4325921484087232
__________________________________
In [93]:
results_df_2 = pd.DataFrame(data=[["Robust Regression", *evaluate(y_test, rb_test_pred) , cross_val(RANSACRegressor())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Ridge Regression

In [49]:
ridge_reg = Ridge(alpha=100, solver='cholesky', tol=0.0001, random_state=42)
ridge_reg.fit(X_train, y_train)

ridge_test_pred = ridge_reg.predict(X_test)
ridge_train_pred = ridge_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, ridge_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, ridge_train_pred)
Test set evaluation:
_____________________________________
MAE: 315.98443721955937
MSE: 177453.30652250646
RMSE: 421.2520700513013
R2 Square 0.5661898113684563
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 322.4911601049207
MSE: 188224.58145730325
RMSE: 433.8485697306184
R2 Square 0.545421115981115
__________________________________
In [50]:
results_df_2 = pd.DataFrame(data=[["Ridge Regression", *evaluate(y_test, ridge_test_pred) , cross_val(Ridge())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Lasso

In [51]:
lasso_reg = Lasso(alpha=0.1, 
              precompute=True, 
#               warm_start=True, 
              positive=True, 
              selection='random',
              random_state=42)
lasso_reg.fit(X_train, y_train)

lasso_test_pred = lasso_reg.predict(X_test)
lasso_train_pred = lasso_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, lasso_test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, lasso_train_pred)
Test set evaluation:
_____________________________________
MAE: 341.8390278337629
MSE: 205953.8702906379
RMSE: 453.8214079245688
R2 Square 0.496516074729515
__________________________________
====================================
Train set evaluation:
_____________________________________
MAE: 346.266402557663
MSE: 214695.85910001665
RMSE: 463.3528451407379
R2 Square 0.4814906571844336
__________________________________
In [52]:
results_df_2 = pd.DataFrame(data=[["Lasso Regression", *evaluate(y_test, lasso_test_pred) , cross_val(Lasso())]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])
results_df = results_df.append(results_df_2, ignore_index=True)

Random forest

In [105]:
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)

rf_test_pred = rf_reg.predict(X_test)
rf_train_pred = rf_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_train_pred)
Test set evaluation:
_____________________________________
MAE: 87.7410703043022
MSE: 17222.50207639035
RMSE: 131.23453080797884
R2 Square 0.813348613130709
__________________________________
Train set evaluation:
_____________________________________
MAE: 31.73728356195188
MSE: 2185.0503010119182
RMSE: 46.744521614964874
R2 Square 0.9770741561454173
__________________________________
In [83]:
results_df_2 = pd.DataFrame(data=[["Random Forest Regressor", *evaluate(y_test, rf_test_pred), 0]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)

Support Vector Machine

In [55]:
svm_reg = SVR(kernel='rbf', C=1000000, epsilon=0.001)
svm_reg.fit(X_train, y_train)

svm_test_pred = svm_reg.predict(X_test)
svm_train_pred = svm_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, svm_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, svm_train_pred)
Test set evaluation:
_____________________________________
MAE: 191.98828119563044
MSE: 98668.1227968321
RMSE: 314.11482422329595
R2 Square 0.7587915502888344
__________________________________
Train set evaluation:
_____________________________________
MAE: 187.7769387799055
MSE: 103427.46890952502
RMSE: 321.60141310250026
R2 Square 0.7502135851238172
__________________________________
In [56]:
results_df_2 = pd.DataFrame(data=[["SVM Regressor", *evaluate(y_test, svm_test_pred), 0]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', 'Cross Validation'])
results_df = results_df.append(results_df_2, ignore_index=True)

Model selection

In [57]:
results_df[['Model',"RMSE","R2 Square"]].sort_values(by='R2 Square').style.background_gradient(cmap = 'coolwarm')
Out[57]:
  Model RMSE R2 Square
1 Robust Regression 515.800372 0.349602
3 Lasso Regression 453.821408 0.496516
2 Ridge Regression 421.252070 0.566190
0 Linear Regression 420.896029 0.566923
5 SVM Regressor 314.114824 0.758792
4 Random Forest Regressor 225.360985 0.875842
In [74]:
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)

clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_title('RMSE by model')

sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_title('R2 Square by model');

Looking at RandomForest closer

Plot of Actual vs Predicted output

In [106]:
plt.figure(figsize=(10,6))

plt.scatter(rf_train_pred,rf_train_pred - y_train,
          c = 'black', marker = 'o', s = 35, alpha = 0.5,
          label = 'Train data')
plt.scatter(rf_test_pred,rf_test_pred - y_test,
          c = 'c', marker = 'o', s = 35, alpha = 0.7,
          label = 'Test data')
plt.xlabel('Predicted values')
plt.ylabel('Tailings')
plt.legend(loc = 'upper left')
plt.hlines(y = 0, xmin = 0, xmax = 3500, lw = 2, color = 'red')
plt.title("Scatter plot of the error of predictions")
plt.show()

This plot enables us to understand for which kind of values we have the most signiiificant errors. On the train set we see that the model is pretty stable, following the red line, errors are massed around 0 with max error of 500 bikes.

However for the test set we see that the error spreads a lot more when predicting big values, the maximum error is now higher to 1500 bikes. Again, it is pretty correct for values predicted between 0 and 1500.

Lets check if the errors observed are normal, if our model is correct, and we juste need to tune it or not.

In [60]:
plt.figure(figsize=(10,6))
sns.displot(y_test-rf_test_pred)
plt.title("repartition of the prediction error");
<Figure size 720x432 with 0 Axes>

So it looks like a normal distribution, no outliers are present, it centered in 0. Thus our model is not incorrect we can continue with its improvment.

Model tuning

Random gridsearch

In [275]:
n_estimators = [1000]
max_features = ['auto']
max_depth =  [int(x) for x in np.linspace(10,100,10)]
min_samples_split = [2,5,15,30]
min_samples_leaf = [1,2,5,20]

random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf
}

# First create the base model to tune
rf = RandomForestRegressor()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', 
                               n_iter = 5, cv = 3, random_state=42, n_jobs = 1)

rf_random.fit(X_train,y_train);
In [276]:
rf_random_test_pred = rf_random.predict(X_test)
rf_random_train_pred = rf_random.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf_random_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf_random_train_pred)
Test set evaluation:
_____________________________________
MAE: 144.11846750879187
MSE: 53315.40472354186
RMSE: 230.9012878343078
R2 Square 0.8696628074543465
__________________________________
Train set evaluation:
_____________________________________
MAE: 100.78816187579896
MSE: 26772.530428548504
RMSE: 163.6231353707308
R2 Square 0.9353419892856454
__________________________________

Random Forest with 1000 trees

In [107]:
rf2_reg = RandomForestRegressor(n_estimators=1000)
rf2_reg.fit(X_train, y_train)

rf2_test_pred = rf2_reg.predict(X_test)
rf2_train_pred = rf2_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf2_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf2_train_pred)
Test set evaluation:
_____________________________________
MAE: 87.7538126967471
MSE: 17190.62269385467
RMSE: 131.11301496744963
R2 Square 0.8136941106046782
__________________________________
Train set evaluation:
_____________________________________
MAE: 31.45397076680909
MSE: 2143.9364745581292
RMSE: 46.302661635786436
R2 Square 0.9775055279839089
__________________________________

Random Forest with 1500 trees

In [278]:
rf3_reg = RandomForestRegressor(n_estimators=1500)
rf3_reg.fit(X_train, y_train)

rf3_test_pred = rf3_reg.predict(X_test)
rf3_train_pred = rf3_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, rf3_test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, rf3_train_pred)
Test set evaluation:
_____________________________________
MAE: 140.5512194225722
MSE: 51572.786940957834
RMSE: 227.09642652617376
R2 Square 0.8739228878314883
__________________________________
Train set evaluation:
_____________________________________
MAE: 51.95985552742616
MSE: 7160.954874487089
RMSE: 84.62242536400791
R2 Square 0.982705665486671
__________________________________

Selecting the final model

In [279]:
results_df = pd.DataFrame(data=[["Random Forest", *evaluate(y_test, rf_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])


results_df_2 = pd.DataFrame(data=[["Random Forest tuned randomly", *evaluate(y_test, rf_random_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)

results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1000)", *evaluate(y_test, rf2_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)

results_df_2 = pd.DataFrame(data=[["Random Forest (ntree=1500)", *evaluate(y_test, rf3_test_pred)]], 
                            columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df = results_df.append(results_df_2, ignore_index=True)
In [289]:
f, ax = plt.subplots(figsize=(15, 10),nrows=2,ncols=1)

clrs = ['#5499c7' for x in range(len(results_df)) ]
sns.barplot(x="Model", y="RMSE", data=results_df,palette=clrs,ax=ax[0])
mask2 = results_df["RMSE"]==results_df["RMSE"].min()
sns.barplot(x="Model", y=results_df["RMSE"][mask2],data=results_df, color='green',ax=ax[0])
ax[0].set_ylim(200,240)
ax[0].set_title('RMSE by model')

sns.barplot(x="Model", y="R2 Square", data=results_df,palette=clrs,ax=ax[1])
mask2 = results_df["R2 Square"]==results_df["R2 Square"].max()
sns.barplot(x="Model", y=results_df["R2 Square"][mask2],data=results_df, color='green',ax=ax[1])
ax[1].set_ylim(0.6,0.9)
ax[1].set_title('R2 Square by model');

We chose

In [26]:
final_model = rf3_reg # rf with ntree = 1500

After analyzing the data and removing the irrelevant columns, we tried different kind of machine learning models to see which one suits our dataset the best. In the end, we can clearly see that the Random Forest model provides us the best results (r²:0.87 / RMSE:222) on the test set, and that is why we take the decision to use this model in our flask API.

After trying to improve this model, we finally concluded that the number of estimators was the only parameter of improvement for the model, it is great to increase it until 1500, as the improvement is not significant anymore after this level (increases the R2 square by 0.000001).

Our model is efficient when predicting values under 1500, when it is larger it doesn't match reality. This can be the result of a number of rows of high value (higher than 1500) not big enough to train the model correctly for huge values.

If we take the MAE metric, we can tell that our predictions are correct with +-139 bikes of error.

In fact if we look at the dataset, we see that 75% of the rows corresponds to values (number of bikes rented) under 1085. Which is too narrowed.

We trained our model on a dataset excluding values above 1500 and the MAE was divided by 2, without losing too much of R2 square. It shows that if we had a dataset with more diversed values for the target, then we would surely have better results.

Saving the model

In [ ]:
import pickle
In [28]:
with open("model.pkl","wb") as f:
    pickle.dump(final_model,f)

An example of the usage :

In [29]:
with open("model.pkl","rb") as f:
    clf2 = pickle.load(f)
In [32]:
print("Predicted : ",round(clf2.predict(X_test)[20])," vs  Actual : ",y_test.iloc[20])
Predicted :  2222  vs  Actual :  2267